Picture for Moritz Hardt

Moritz Hardt

Good Allocations from Bad Estimates

Add code
Jan 09, 2026
Viaarxiv icon

Scaling Open-Ended Reasoning to Predict the Future

Add code
Dec 31, 2025
Viaarxiv icon

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Add code
Oct 06, 2025
Figure 1 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Figure 2 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Figure 3 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Figure 4 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Viaarxiv icon

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Add code
Jul 03, 2025
Viaarxiv icon

How Benchmark Prediction from Fewer Data Misses the Mark

Add code
Jun 09, 2025
Figure 1 for How Benchmark Prediction from Fewer Data Misses the Mark
Figure 2 for How Benchmark Prediction from Fewer Data Misses the Mark
Figure 3 for How Benchmark Prediction from Fewer Data Misses the Mark
Figure 4 for How Benchmark Prediction from Fewer Data Misses the Mark
Viaarxiv icon

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Add code
Oct 17, 2024
Figure 1 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Figure 2 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Figure 3 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Figure 4 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Viaarxiv icon

Lawma: The Power of Specialization for Legal Tasks

Add code
Jul 23, 2024
Figure 1 for Lawma: The Power of Specialization for Legal Tasks
Figure 2 for Lawma: The Power of Specialization for Legal Tasks
Figure 3 for Lawma: The Power of Specialization for Legal Tasks
Figure 4 for Lawma: The Power of Specialization for Legal Tasks
Viaarxiv icon

Evaluating language models as risk scores

Add code
Jul 19, 2024
Figure 1 for Evaluating language models as risk scores
Figure 2 for Evaluating language models as risk scores
Figure 3 for Evaluating language models as risk scores
Figure 4 for Evaluating language models as risk scores
Viaarxiv icon

Training on the Test Task Confounds Evaluation and Emergence

Add code
Jul 10, 2024
Figure 1 for Training on the Test Task Confounds Evaluation and Emergence
Figure 2 for Training on the Test Task Confounds Evaluation and Emergence
Figure 3 for Training on the Test Task Confounds Evaluation and Emergence
Figure 4 for Training on the Test Task Confounds Evaluation and Emergence
Viaarxiv icon

Allocation Requires Prediction Only if Inequality Is Low

Add code
Jun 19, 2024
Figure 1 for Allocation Requires Prediction Only if Inequality Is Low
Figure 2 for Allocation Requires Prediction Only if Inequality Is Low
Figure 3 for Allocation Requires Prediction Only if Inequality Is Low
Figure 4 for Allocation Requires Prediction Only if Inequality Is Low
Viaarxiv icon